Skip to content

feat(workflow-operator): support COUNT(*) in the Aggregate operator#5896

Open
tanishqgandhi1908 wants to merge 11 commits into
apache:mainfrom
tanishqgandhi1908:feat/count-star
Open

feat(workflow-operator): support COUNT(*) in the Aggregate operator#5896
tanishqgandhi1908 wants to merge 11 commits into
apache:mainfrom
tanishqgandhi1908:feat/count-star

Conversation

@tanishqgandhi1908

@tanishqgandhi1908 tanishqgandhi1908 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

This PR adds COUNT(*) support to the Aggregate operator, so users can count all rows without selecting a column — useful when there's no natural column to count.

The existing count function now handles both cases via an optional attribute:

  • count with an empty AttributeCOUNT(*) — counts every row (including rows with nulls)
  • count with a column → counts that column's non-null values (unchanged)

The Attribute stays required for every other function (sum, average, min,
max, concat), enforced by a conditional JSON-schema rule and surfaced in the UI
with the usual required marker.

Backend

  • count counts all rows when the attribute is empty/blank, otherwise counts non-null values of the selected column.
  • Conditional-required rule: attribute is required unless the function is count (validated by Ajv, so an empty attribute on other functions marks the operator invalid).

Frontend

  • The Attribute is a normal optional field for count; the required marker (red *) shows for all other functions. Description hints that an empty attribute counts all rows.

Docs

  • Updated the Aggregate operator reference.

Screenshots

image 2C3CB4D9-7FA6-4098-B7F7-293651EEA7D9

Any related issues, documentation, discussions?

Closes #3142.

How was this PR tested?

Automated (AggregateOpSpec, AggregateOpDescSpec):

  • count with an empty (or null) attribute counts all rows including nulls.
  • count with a column counts only non-null values.
  • getAggregationAttribute / schema propagation / executor tolerate an empty attribute (no input-column lookup; result typed INTEGER).
  • Existing sum/average/min/max/concat/getFinal behavior unchanged.

All Aggregate tests pass: sbt "WorkflowOperator/testOnly *aggregate*".

Manual (UI): verified count + empty Attribute counts all rows (with and without Group By), count + a column counts non-null values, and non-count functions with an empty Attribute are invalid (Run disabled).

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

@github-actions github-actions Bot added feature frontend Changes related to the frontend GUI docs Changes related to documentations common labels Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @Yicong-Huang, @aglinxinyuan, @Ma77Ball
    You can notify them by mentioning @Yicong-Huang, @aglinxinyuan, @Ma77Ball in a comment.

@codecov-commenter

codecov-commenter commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.34%. Comparing base (a1a7eb0) to head (6175d4d).

Files with missing lines Patch % Lines
...it-frame/operator-property-edit-frame.component.ts 40.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5896      +/-   ##
============================================
- Coverage     56.38%   56.34%   -0.05%     
  Complexity     2986     2986              
============================================
  Files          1129     1129              
  Lines         43794    43800       +6     
  Branches       4743     4744       +1     
============================================
- Hits          24693    24678      -15     
- Misses        17650    17665      +15     
- Partials       1451     1457       +6     
Flag Coverage Δ *Carryforward flag
access-control-service 70.00% <ø> (ø)
agent-service 44.59% <ø> (ø) Carriedforward from a1a7eb0
amber 57.26% <100.00%> (-0.06%) ⬇️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 52.30% <ø> (ø)
file-service 62.81% <ø> (ø)
frontend 50.09% <40.00%> (-0.04%) ⬇️
notebook-migration-service 78.57% <ø> (ø)
pyamber 90.20% <ø> (ø) Carriedforward from a1a7eb0
python 90.76% <ø> (ø) Carriedforward from a1a7eb0
workflow-compiling-service 55.14% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 5 worse · ⚪ 8 noise (<±5%) · 0 without baseline

Compared against main a1a7eb0 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 360 0.22 27,899/32,782/32,782 us 🔴 +23.9% / 🔴 +123.3%
🔴 bs=100 sw=10 sl=64 759 0.463 128,963/160,453/160,453 us 🔴 +7.5% / 🔴 +50.1%
bs=1000 sw=10 sl=64 876 0.535 1,134,172/1,190,297/1,190,297 us ⚪ within ±5% / 🔴 +16.3%
Baseline details

Latest main a1a7eb0 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 360 tuples/sec 392 tuples/sec 786.27 tuples/sec -8.2% -54.2%
bs=10 sw=10 sl=64 MB/s 0.22 MB/s 0.239 MB/s 0.48 MB/s -7.9% -54.2%
bs=10 sw=10 sl=64 p50 27,899 us 22,520 us 12,495 us +23.9% +123.3%
bs=10 sw=10 sl=64 p95 32,782 us 35,899 us 14,784 us -8.7% +121.7%
bs=10 sw=10 sl=64 p99 32,782 us 35,899 us 18,468 us -8.7% +77.5%
bs=100 sw=10 sl=64 throughput 759 tuples/sec 777 tuples/sec 991.49 tuples/sec -2.3% -23.4%
bs=100 sw=10 sl=64 MB/s 0.463 MB/s 0.475 MB/s 0.605 MB/s -2.5% -23.5%
bs=100 sw=10 sl=64 p50 128,963 us 124,399 us 100,929 us +3.7% +27.8%
bs=100 sw=10 sl=64 p95 160,453 us 149,262 us 106,894 us +7.5% +50.1%
bs=100 sw=10 sl=64 p99 160,453 us 149,262 us 114,085 us +7.5% +40.6%
bs=1000 sw=10 sl=64 throughput 876 tuples/sec 873 tuples/sec 1,023 tuples/sec +0.3% -14.4%
bs=1000 sw=10 sl=64 MB/s 0.535 MB/s 0.533 MB/s 0.624 MB/s +0.4% -14.3%
bs=1000 sw=10 sl=64 p50 1,134,172 us 1,140,797 us 983,835 us -0.6% +15.3%
bs=1000 sw=10 sl=64 p95 1,190,297 us 1,203,714 us 1,023,777 us -1.1% +16.3%
bs=1000 sw=10 sl=64 p99 1,190,297 us 1,203,714 us 1,053,883 us -1.1% +12.9%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,555.06,200,128000,360,0.220,27898.75,32782.08,32782.08
1,100,10,64,20,2635.89,2000,1280000,759,0.463,128963.37,160453.27,160453.27
2,1000,10,64,20,22819.99,20000,12800000,876,0.535,1134171.78,1190296.76,1190296.76

@aglinxinyuan

Copy link
Copy Markdown
Contributor

@chenlica, what do you think? Should we support COUNT(*)?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class COUNT(*) support to the Aggregate operator so users can count rows without selecting a column, with backend execution/schema support and a small Aggregate-specific UI rule to disable the Attribute field when count(*) is selected.

Changes:

  • Backend: introduce COUNT_STAR("count(*)"), implement row-count aggregation, and rewrite both COUNT variants to SUM for the global aggregation stage.
  • Backend/validation: make attribute conditionally required via JSON-schema (attribute required for all functions except count(*)).
  • Frontend + docs: disable/clear the Attribute field when count(*) is selected and document the new behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
frontend/src/app/workspace/component/property-editor/operator-property-edit-frame/operator-property-edit-frame.component.ts Disables + clears Aggregate “Attribute” field when count(*) is chosen.
docs/reference/operators/data-cleaning/aggregate/aggregate.md Documents count(*) and clarifies COUNT vs COUNT(*).
common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregationFunction.java Adds COUNT_STAR("count(*)") enum value for JSON.
common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregationOperation.scala Adds COUNT(*) semantics, conditional attribute requirement in schema, and global-stage rewrite logic.
common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregateOpExec.scala Updates executor initialization to tolerate COUNT(*) without input-column lookup.
common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/aggregate/AggregateOpDesc.scala Updates schema propagation to tolerate COUNT(*) without input-column lookup.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/aggregate/AggregateOpSpec.scala Adds unit/integration coverage for COUNT(*) behavior and final-stage rewrite.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/aggregate/AggregateOpDescSpec.scala Adds schema-propagation coverage for COUNT(*) output typing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

@chenlica

Copy link
Copy Markdown
Contributor

@mengw15 Please review it.

@chenlica chenlica requested review from mengw15 and removed request for chenlica June 23, 2026 07:02
@Yicong-Huang

Copy link
Copy Markdown
Contributor

maybe a bit more to ask, but can we hide the Attribute input box when count(*) is selected?

alternatively, you can add a * value in Attribute so that its semantic is func: count, value: "*".

@Yicong-Huang

Yicong-Huang commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@chenlica, what do you think? Should we support COUNT(*)?

This better to be discussed on issue #3142, not PR.

@tanishqgandhi1908

Copy link
Copy Markdown
Contributor Author

maybe a bit more to ask, but can we hide the Attribute input box when count(*) is selected?

alternatively, you can add a * value in Attribute so that its semantic is func: count, value: "*".

Thanks! Quick reasoning on both, and happy to change either:

I disabled (greyed out) the Attribute instead of hiding it so the rows don't change height/jump when switching functions, the layout stays consistent. But hiding is a one-line change can do if team prefers it.

The Attribute box is a single column-picker shared by every row and filled from the input schema. If we add a * option to it, that * would also show up for sum, min, max, etc, since the dropdown is the same for all functions. Limiting * to count only would need extra per-row conditional logic.

@Yicong-Huang

Copy link
Copy Markdown
Contributor

yeah but if we don't hide/remove/reset the value input box, it may have this experience:

user selects Min, age, then switch to Count(*), age which can be confusing. we want to avoid that situation.

@tanishqgandhi1908

tanishqgandhi1908 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

That case is already handled, when you switch to count(*) we reset the attribute, so it shows count(*) with an empty box, not count(*), age. The change to hide is minor, do let me know what you think!

@Yicong-Huang

Copy link
Copy Markdown
Contributor

That case is already handled, when you switch to count(*) we reset the attribute, so it shows count(*) with an empty box, not count(*), age. The change to hide is minor, do let me know what you think!

ok so it is doing reset already. that will do. to hide it is optional and I will leave you to decide for the best experience! thanks

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments in line

@chenlica

Copy link
Copy Markdown
Contributor

@mengw15 Please review it after the pass of @Yicong-Huang .

@zuozhiw

zuozhiw commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Same comment as yicong, we should do function count and * as a special column you can select there. honestly I think it will be eaiser if we make attribute an optional parameter and if you don't select anything it means count(*)

@tanishqgandhi1908

Copy link
Copy Markdown
Contributor Author

According to comments and suggestions from both of you ( @zuozhiw and @Yicong-Huang )
I was then thinking of implementing and updating the PR for count() as a single count with an optional attribute. If the attribute is empty -> count(), if the attribute is a column -> count(column).

Would this be okay and make sense?

@zuozhiw

zuozhiw commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@tanishqgandhi1908 the new proposal sounds good thx

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

@tanishqgandhi1908

Copy link
Copy Markdown
Contributor Author

Updated the PR and its description accordingly!

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one more test to add

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common docs Changes related to documentations feature frontend Changes related to the frontend GUI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support COUNT(*) for Aggregate COUNT function

7 participants